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Abstract 

We present a joint copula-based model for insurance claims and sizes. It uses 
bivariate copulae to accommodate for the dependence between these quantities. We 
derive the general distribution of the policy loss without the restrictive assumption of 
independence. We illustrate that this distribution tends to be skewed and multi-modal, 
and that an independence assumption can lead to substantial bias in the estimation of 
the policy loss. Further, we extend our framework to regression models by combining 
marginal generalized linear models with a copula. We show that this approach leads 
to a flexible class of models, and that the parameters can be estimated efficiently using 
maximum-likelihood. We propose a test procedure for the selection of the optimal 
copula family. The usefulness of our approach is illustrated in a simulation study and 
in an analysis of car insurance policies. 

Keywords: dependence modeling; generalized linear model; number of claims; claim 
size; policy loss 



1 Introduction 

Estimating the total loss of an insurance portfolio is crucial for many actuarial decisions, 
e.g. for pricing of insurance contracts and for the calculation of premiums. A very common 



approach, based on the compound model by Lundberg (1903), models the average claim 



size and the number of claims independently, and then defines the loss as the product of 
these two quantities. However, the assumption of independence can be too restrictive and 
lead to a systematic over- or under-estimation of the policy loss. Evidently, this effects the 
accuracy of the estimation of the portfolio loss. 
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We therefore propose a joint model that explicitly allows a dependency between average 
claim sizes and number of claims. This is achieved by combining marginal distributions 
for claim frequency and severity with families of bivariate copulae. A main contribution 
of this paper is the derivation of the distribution of the loss of an insurance policy. We 
illustrate that the distribution is often very skewed, and that - depending on the model 
parameters - the distribution is multi-modal. Based on this distribution, we can estimate 
the expected policy loss and its quantiles. We show that the distribution, and in particular 
its mean, depends strongly on the degree of dependence. This underpins the usefulness of 
our copula-based model. 

Dependence modeling using copulae has become very popular the last years (see the 
standard reference books by Joe (1997) and Nelsen (2006)) and was introduced to actuarial 
mathematics by Frees and Valdez (1998). Since then, copulae have been used, e.g., for the 
modeling of bivariate loss distributions by Klugman and Parsa ( 1999 ) and of dependencies 



between loss triangles by De Jong (2009), as well as for risk management (see McNeil et al. 



(|2005j)). 

It is common practice to model average claim sizes and number of claims in terms of 
a set of covariates as e.g. gender or age (see, e.g., Haberman and Renshaw (1996) for an 
overview). Typically, claim frequency and severity are however modeled separately under 



the independence assumption of Lundberg (1903). GschloBl and Czado (2007) therefore 



included the number of claims as a covariate into the model for average claim size. To allow 
for more flexibility and generality in the type of dependence, we extend our copula-based 
model to regression models by combining generalized linear models for the two marginal 



regression models with copula families. This is an extension of a recent approach by Czado 



et al. (2011 ) and De Leon and Wu (2011 ) who only consider a Gauss copula based on work 



by |Song| fl2000| |2007[ ) and |Song et al.|d2009[ ). 

In our general copula-based regression approach, the model parameters can be esti- 
mated efficiently using maximum-likelihood techniques. Further, we provide asymptotic 
confidence intervals that allow us to quantify the uncertainty of our estimates. For the 
selection of the copula family, we propose the likelihood-ratio test by Vuong (1989). 

In an extensive simulation study, we show that the incorporation of a copula allows 
a more precise estimation of the individual policy losses, which in turn leads to a more 
reliable estimation of the total loss. These results are confirmed in a case study on car 
insurance policies. 



2 Bivariate copulae for continuous-discrete data 
2.1 Background: bivariate copulae 

A bivariate copula C : [0, 1] x [0, 1] — > [0, 1] is a bivariate cumulative distribution function on 
[0, 1] x [0, 1] with uniformly distributed margins. The importance of copulae is underpinned 



by Sklar's Theorem (1959). In the bivariate case, it states that for every joint distribution 
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function Fxy of a bivariate random variable (X, Y) with univariate marginal distribution 
functions Fx and Fy, there exists a bivariate copula C such that 



Fxy{x,y) = C (F x (x) , F Y (y)) . (1) 

If X and Y are continuous random variables, the copula C is unique. Conversely, if C is a 
copula, Equation ([I]) defines a bivariate distribution with marginal distribution functions 
Fx and Fy- This allows us to model the marginal distributions and the joint dependence 
separately, as we can define the copula C independently of the marginal distributions. 

Copulae are invariant under monotone transformations of the marginal distributions. 
Therefore, instead of the correlation coefficient - which measures linear associations - 
monotone association measures are used. A very common choice is Kendall's r, 

C{u,v)dC{u,v) - 1 G [-1,1] . 

[0,1] 2 

In this paper, we study copula-based models for a pair of continuous-discrete random 
variables. We denote the continuous random variable by X, and the discrete random 
variable by Y. We assume that Y takes values in 1,2,.... Their joint distribution is 
defined by a parametric copula C(-,-\9) that depends on a parameter 9, i.e. the joint 
distribution is given by 

F X ,Y\0(. X >V) = C(Fx(x),F Y (y)\e). 

We focus on four families of parametric bivariate copulae, namely the Clayton, Gumbel, 
Frank and Gauss copulae. Each family depends on a copula parameter 6. These parameters 
can be expressed in terms of Kendall's r. The definitions of the copula families and their 
relationship to Kendall's r are provided in [Aj We note that the Clayton copula is only 
defined for positive values of Kendall's r, and that the Gumbel copula is only defined for 
non- negative values of Kendall's r. Via a rotation, it is however possible to extend these 
copula families to negative values of r. An overview on bivariate copula families and their 



properties, in particular their different tail behavior, can be found e.g. in Brechmann and 



Schepsmeier (2012) 



For sampling, estimation and prediction, we need the joint density /probability mass 
function of X and Y that is defined as 

fx,Y(x>v) : = ^P{X<x,Y = y). (2) 

In the remainder of this paper, we will refer to / as the joint density function of X and Y. 

We now derive formulas for the joint density of X and Y in terms of the copula C(-,-\6). 
We denote by 

D^uMO) := ^C(u,v\e) (3) 
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for u, v e]0, 1[ the partial derivative of the copula with respect to the first variable. Note 
that this is the conditional density of the random variable V := F Y (Y) given U := Fx(X) 
(Joe, 1997). Table [d] in [A| shows the partial der ivative (|3| of the Clayton, Gumbel, Frank 
and Gauss copula (see e.g. Aas et al. (2009) or Schepsmeiner and Stober (2012) for more 
details). 

Proposition 1 (Density function). The joint density function fxY of a continuous random 
variable X and a discrete random variable Y is given by 



fx,Y{x,y\6) 
Proof. By definition 

d 



dx 



P(X <x,Y = y) 



f x (x) (D^Fxix), F Y (y)\9) - D l {F x {x),F Y (y - 1)|0)) • (4) 



d_ 

dx 




P{X <x,Y <y) 



d_ 
dx 



P(X < x,Y < y - 1) 







C(F x (x),F Y (y)\6) - —C(F x (x),F Y (y-l)\9) 



fx{x) (D 1 (F x (x),F Y (y),9) - D x {F x {x\ F Y (y - 1)\9)) 



which proves the statement. 



□ 



2.2 Marginal distributions 

The framework presented in the preceding subsection holds for general pairs of continuous- 
discrete random variables. In this paper, we focus on joint models for a positive average 
claim size X and a positive number of claims Y. We model the average claim size X via a 
Gamma distribution 

i 

f x {zM = (£fex P (-£) for*>0, (5) 

with mean parameter ji > and dispersion parameter 5 > 0. The number of claims Y is 
a positive count variable, and is modeled as a zero-truncated Poisson (ZTP) distributed 
variable with parameter A > 0, 

fy(y\ X ) = y!(i-cM-\)) exp(-A) for y = 1,2,.... 

The generality of our approach easily allows to use other appropriate distributions such 
as the log-normal for claim severity or the (zero-truncated) Negative Binomial for claim 
frequency. The models and results presented below can be extended accordingly. 
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distribution 

parameter(s) 
expectation 

variance 



average 
claim size X 
Gamma 

A* > 0, 5 > 
E(X) = /i 

Var(X) = fi 2 5 



number of claims Y 
zero-truncated Poisson 



A > 



E(Y) 
Var(Y) = 



l-i 
A(l-e 



(A+l)) 



(1- 



V )2 



copula family 
Gauss, Clayton 
Gumbel, Frank 

e g e 



Table 1: Model parameters of the joint distribution for average claim sizes X and number 
of claims Y. The definition of the copula families is provided in\A~\ 

2.3 Joint copula model for average claim sizes and number of claims 

Combining the marginal distributions and the copula approach, we obtain the following 
general model. 

Definition 2 (Joint copula model for average claim sizes and number of claims). The 

copula-based Gamma and zero-truncated Poisson model for positive average claim sizes X 
and positive number of claims Y is defined by the joint density function 

fx t Y(x,y\fi,5,X,d) 

= f x (x\n, 6) (D^Fxton, 5),F Y (y\X)\6) - D 1 (F x (x\ t i, S),F Y (y - 1\X)\9)) , ' " j 
for x > and y = 1, 2, . . .. 

The model depends on four parameters: the parameters /x, S (Gamma) and A (ZTP) for 
the marginal distributions, and the copula parameter 9. Table [T] displays the parameters 
and their relationships to the joint distribution. 

We now illustrate the influence of the copula parameters and families on the conditional 
distribution of Y. Therefore, we use the following result. 

Proposition 3. The conditional distribution Y\X = x of the number of claims given an 
average claim size of x under the copula-based model of Definition^ is given by 

P(Y = y\X = x,v,6,X,0) = D l (F x {x\^5),F Y {y\X)\6) 

-D 1 (F x (x\ti,6),F Y (y-l\X)\9). 

Proof. This result follows from Proposition [TJ as by definition for two random variables X 
and Y 

P(Y = y\X = x) = ^fpf. 

fx{x) 

□ 
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Gauss copula 



t=0.3 




number of claims number of claims 



Figure 1: Conditional probability mass function of the number of claims Y. Marginal 
distributions: expected average claimsize (jl = 1000 Euro with dispersion parameter 5 = 
0.09, expected number of claims A = 2.5. We condition on an average claim size of x = 1200 
Euro. Left: Gauss copula with r = 0; 0.1; 0.3; 0.5. Right: Gauss, Clayton, Gumbel and 
Frank copula with r = 1/3. 



Example 4. We consider a group of policy holders with an expected number of claims of 
A = 2.5. The average claim size is set to fi = 1000 Euro, and we assume that the standard 
deviation of the average claim size equals 300 Euro, which leads to a dispersion parameter 
of 

1000 2 

We condition on an average claim size of x = 1200 Euro. 

The left panel in Figure [T] displays the conditional probability mass function ([7]) of 
Y\ X = x for a Gauss copula with four different values of r = 0; 0.1; 0.3; 0.5 . We observe 
that the four probability mass functions are different, and that for higher values of r, more 
mass is assigned to larger values of y. This is due to the dependence of X and Y and the fact 
that the conditioning value x = 1200 Euro is much higher than the expected average claim 
size of fi = 1000 Euro. The right panel displays the conditional probability mass function 
for r = 1/3 and the four different copula families. The choice of the copula family clearly 
influences the conditional distribution. In particular, the upper-tail dependent Gumbel 
copula shifts the distribution to the right compared to the other copulae. This leads to a 



6 



flexible class of dependence models between a discrete and a continuous variable. 

3 Policy loss estimation 

Next, we focus on the distribution of the policy loss. 

Definition 5 (Policy loss). For a policy with average claim size X and number of claims 
Y , the policy loss is defined as the product of the two quantities, 



The policy loss is a positive, continuous random variable, and it depends on the four 
model parameters displayed in Table [TJ A main contribution of this paper is the following 
result. 

Theorem 6. The distribution of the policy loss L is given by the density function 



f L {i\n,s,\,e) = ]T [di (f x (}m) ,f y ( y \x)\e) -Dt (f x (£|m) ,iV(y-i|A)|0) 



Proof For simplicity of notation, we omit the model parameters from the formulas. We 
consider the two-dimensional random variable 



L := X Y . 



oo 



y=l 




for I > 0. 



(L,Y) 1 G R + x {1,2,. . . 



} 



and derive its joint density mass function. By definition (see Equation 




f L ,Y(l,y) 



d 
di 



P(L<l,Y = y) 




as X = L/Y. Substituting x 



l/y, we obtain 



h,Y(l,y) 



d_ 
dx 



P(X <x,Y 



y)- 



dx 




The result then follows by marginalizing over the discrete random variable Y. 



□ 
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Gauss copula 



Clayton copula 



Gumbel copula 



Frank copula 




Figure 2: Densities of the policy loss for the four copula families and three different values 
of Kendall's r. We use the parameters settings of Example |4j 

This implies that we can evaluate the density of the policy loss for all of our four copula 
models, given a fixed set of parameters. Further, we can evaluate its mean, variance and 
quantiles based on the density function. 

In a first step, we visualize the densities for a given set of parameters in order to 
investigate the differences between the four copula types and the degree of dependence 
between the average claim size and the average number of claims. In the simulation study 
(Section [6]) and the case study (Section [7]), we show that in the context of regression, the 
copula-based model leads to a more precise estimation of the policy loss compared to the 
independence assumption. 

We continue Example [4] and use the same parameter settings for the marginal distri- 
butions. Figure [2] displays the density of the policy loss for all four copula families and for 
three different values of Kendall's r equal to 0.1, 0.3 and 0.5. First, we observe that the 
distribution is in general left skewed. Further, we observe that the theoretical densities 
tend to be multimodal, and the multiple modes become more distinct for increasing values 
of Kendall's r. The skewness and multi-modality can be readily explained by Theorem [6j 
Setting 



u(y,l\f,,S,X,9) :=-P[Y = y 

y 



X=-,»,5,\,6 1 

y 



the density of the policy loss can be written as an infinite mixture of Gamma distributions 
f L (l\H, 5, X,9) = oj(y, %, 5, A, 9) ■ f x Q \», . 

As the individual Gamma densities are skewed, the density of the mixture tends to be 
skewed, too. Moreover, a mixture of unimodal Gamma densities can be multi-modal as 
well. The parameter settings of the model influence the number of the modes and how 
pronounced they are. 



S 



Figure [3] displays the expectation the 25%-quantile (?o.25;L and 75%-quantile (?o.75;L 
of the policy loss as a function of Kendall's r. All three quantities are evaluated using 
numerical integration and numerical root solvers. We use the parameter settings of Ex- 
ample [4] for the marginal distributions. The solid and dotted lines indicate the mean and 
the quantiles if we assume that average claim sizes and number of claims are independent. 
We observe that the independence assumption leads to an overestimation of the policy 
loss if average claim sizes and number of claims have a negative monotone association (i.e. 
r < 0), and it leads to an underestimation if r > 0. As an example, we compare the ex- 
pected policy loss under independence (which equals 2723 Euro) to the expected policy loss 
for r = 0.2. We obtain 2860 (+5%) Euro (Gauss), 2837 (+4%) Euro (Clayton), 2880 (+6%) 
Euro (Gumbel) and 2850 (+5%) Euro (Frank). 

Based on Figures [2] and [3j we observe a strong dependence of the distribution of the 
policy loss on the size of Kendall's r. However, we do not observe a strong dependence on 
the choice of the copula family. 



Gauss copula Clayton copula Frank copula 
1 I — — 1 Gumbel copula . 




Figure 3: Expected policy loss (blue diamonds) and upper and lower quartiles for the 
four copula families, seen as a function of Kendall's r. For negative values of Kendall's 
r the Clayton and the Gumbel copula have been rotated. The parameter settings for the 
marginal distributions are taken from Example [4| The grey solid and dotted lines indicate 
the expected policy loss and upper/lower quartiles if we assume independence. 



4 Copula regression model for average claim sizes and num- 
ber of claims 

In the two previous sections, we modeled average claim sizes and number of claims inde- 
pendently of possible covariates. In order to incorporate covariates, we use the approach 
by Czado et al. (2011 ). We extend the joint model Q for average claim sizes X and num- 



ber of claims Y by allowing the marginal distributions of X and Y to depend on a set of 
covariates. More precisely, we apply generalized linear models for the marginal regression 
problems and combine these with bivariate copula families. 
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4.1 Model formulation 

Let Xi G M+, i = 1, 2, . . . , n, be independent continuous random variables and let Y{ E N>o, 
z = l,2,...,n, be independent discrete random variables. We model Xi in terms of a 
covariate vector n G M p and 1^ in terms of a covariate vector Sj G M 9 . The marginal 
regression models are specified via 

Xi ~ Gamma(/ti,(5) with ln(/ij) = a, 
Yi ~ ZTP(Ai) with ln(Ai) = lnfe) + s; T /3. 

Here denotes the exposure time. We remark that the covariate vectors Ti and Sj can be 
distinct. 

4.2 Parameter estimation 

We need to estimate the unknown parameter vector 

v:={a T ,(3 T ,e,5) T £W +(1+2 (8) 

based on n observation pairs (xi,yi). Here, we use maximumdikelihood estimation tech- 
niques. By definition, the loglikelihood of the model parameters Q is 

n 

£(v\x,y) = ^2ln(fx,Y(xi,yi\v)) (9) 

i=i 

with 

x = (x!,...,a; n ) T G M n and y = (yi,...,y n ) T G M n . 
The maximum likelihood estimates are given by 

v = argmax£ (v\x, y) . 

In general, there is no closed-form solution. Therefore, we have to maximize the loglike- 
lihood numerically. In this paper, we apply the BFGS optimization algorithm (a quasi 
Newton method) to maximize the loglikelihood ([9]). As the copula parameter 6 G O is 
in general restricted (see [A]), we transform via a function g : — > R such that g{6) is 
unrestricted. As an example, for the Gauss copula, the copula parameter 9 lies in ] — 1, 1[, 
and the transformation is defined as 

•»-§-(£-!)■ 

We then optimize the logliklihood with respect to (a T , (3 T , g(8), 5) T . 
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Alternatively, we can estimate the model parameters by applying the inference-for- 
margins (IFM) principle (Joe and Xu, 1996). Here, we proceed in two steps. First, we 
estimate the marginal regression models for average claim sizes and number of claims via 
maximum-likelihood. We obtain estimates 

p, = exp (Ra) G M n 

A = exp ( S0) e G K n 



for each observation and an estimate 5 G M for the dispersion parameter. Here, e G M. is 
the vector of exposure times, and denotes an element-wise multiplication of two vectors. 
These estimates are used to transform the observations x and y to 



Ui 



W; 



Fx (xi\vk,6) G [0,1] 
F Y (yi\%) G [0,1] 

F Y (yi-i\X) e [0,1] 



Here, Fx and Fy are the distribution functions of a Gamma and zero-truncated Poisson 
variable respectively. In the second step, we optimize the copula parameter 9 by maximizing 
the loglikelihood 

n 

l(0\u,v) :=£ln(D 1 (u ij u i |0)-.D 1 (u i ,u; i |0)) . 
i=i 

The function I can be maximized numerically. In general, the run-time for the IFM ap- 
proach is much smaller compared to the maximization of the loglikelihood ([9]). In initial 
simulatons, the performance of the two methods was very similar. This confirms earlier 



findings by De Leon and Wu (2011). Therefore, in the simulations study below, we only 



report the results of the maximum likelihood solution, since it is asymptotically more 
efficient. 



Finally, we note that Czado et al. (2011) recently proposed an extension of the max- 



imization by parts algorithm (Song et al. 2005) to estimate the regression parameters 



These methods could be easily adapted to our model. In this paper, we do not pursue this 
approach and estimate the parameters via maximum-likelihood. 



4.3 Asymptotic distribution of the regression parameters 

For the construction of approximate confidence intervals, we use the Fisher information 
matrix that is defined as 



Kv) 



E 



d£(v\x,y) f d£(v\x, y 



dv 



dv 



5 (p+g+2)x(p+q+2) 
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Under regularity conditions (see, e.g., 



Serfling 



(1980)) one can show that 



(„_©) AaT p+9+2 (O,!- 1 ^)) . 



Here, A/1- denotes a /c-dimensional multivariate normal distribution. For the estimation of 



the Fisher information, we use the fact that (Lehmann and Casella, 1998) 

Z(«) 



-E 



d 2 £(v\x, y) 



Q2 



V 



and use the observed Fisher information matrix 

d 2 i(v\x, y) 



2 



v 



This is the Hessian matrix of the loglikelihood function. In our case, it is feasible to com- 
pute the second partial derivatives explicitly. Moreover, the BFGS optimization algorithm 
returns an approximation of the Hessian matrix that is obtained via numerical derivatives. 
In this paper, we use this approximation to estimate standard errors for the regression 
coefficients. 



4.4 Selection of the copula family 

Up to now, the copula class is assumed to be fixed. For the comparison of two copula 



families, we propose the likelihood-ratio test for non- nested hypotheses by Vuong (1989). 
This test is appropriate as our models are non-nested, i.e. the regression model for one 
copula family cannot be obtained via a restriction of the regression model for the other 
copula family. Let us denote by t^-\iS 2 > G W l the vectors of pointwise loglikelihoods for 
a model with copula family 1 and 2 respectively. Here, we assume that both models have 
the same degrees of freedom, i.e. the same number of parameters. We now compute the 
differences of the pointwise loglikelihood as 



Denote by 



i 

n ^ 

the mean of the differences. The test statistic 



m = — y mi 
n . 



T v ■= . V =, (10) 



£i=l ( m i 



m) 



\2 
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is asymptotically normally distributed with zero mean and unit variance. Hence, we prefer 
copula family 1 to copula family 2 at level a if 



Ty>^ (l-f 

where <3? denotes the standard normal distribution function. If 

T V < (f 

we prefer copula family 2. Otherwise, no decision among the two copula families is possible. 
We note that it is possible to adjust the test if the two models have different degrees of 
freedom. 



5 Estimation of the total loss 

Recall that we model the policy loss Li = Xi ■ Y{ for each policy holder via our joint 
regression model. Its distribution is determined by Theorem |6| In a next step, we are 
interested in the distribution of the total loss over all policy holders. 

Definition 7 (Total loss). For n policies with average claim sizes Xi and number of claims 
Yi (for i = 1, . . . ,n), the total loss is defined as the sum of the n policy losses 

n n 

T:=Y,Li = J2 Xi ' Yi - 

i=l i=l 

Just as the individual policy losses, the total loss is a positive, continuous random 
variable. An application of the central limit theorem leads to the following result. 

Proposition 8 (Asymptotic distribution of the total loss). For n independent policy losses 
Li, . . . , L n with mean ni i and variance o~\., the asymptotic distribution of the total loss T 
is normal. For 

n 
i=l 

we have 

For the estimation of the total loss, we need to estimate the means fj,L. and variances 
a\. of the individual policy losses. This is done by replacing the distribution parameters 
fjji,S,9,Xi of Li by their estimates obtained from our joint regression model. Then, the 
mean and variance can be estimated numerically. 
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intercept age female car type B car type C 



average claim size X -0.50 -0.05 -1.00 +2.00 -0.50 
number of claims Y -1.00 +0.04 +0.30 +0.30 +0.20 



Table 2: Regression coefficients for the simulation study. 



6 Simulation study 

We consider a regression problem with n = 500 policy groups and the following covariates: 
age, gender and type of car (A, B or C). We assume that all policy groups contain the 
same number of persons, which leads to a constant offset. The first column of the design 
matrices 

S :=R:= (n,...,r n ) T G R 500x5 

consists of l's. This corresponds to marginal regression models with an intercept. The 
second column corresponds to the covariate age, and is drawn uniformly in the range of 
18 and 65. The third column is the dummy variable corresponding to female. Here, the 
probability of a female policy group is set to 1/2. The last two columns are the two dummy 
variables corresponding to car type B and car type C. Car type A is represented by the 
intercept. The probability of a certain car type is set to 1/3. The vector of regression 
coefficients are defined in Table [2] As an example, in this simulation scenario, a female 
driver has a negative effect on the average claim size, and a positive effect on the number of 
claims. We set the constant dispersion parameter 5 of the Gamma distribution to 5 = 0.25, 
which implies that the coefficient of variation (CV) fulfills 

We consider the four copula families and three different values of Kendall's r, 

r = 0.1; 0.3; 0.5. 

For each parameter setting, we sample n = 500 observations from the true copula regression 
model, and then fit the regression coefficients and Kendall's r via maximum likelihood. We 
consider estimation methods: (1) the independent model: we fit the two marginal regression 
models and set r = 0. (2) the joint, copula-based model. We also compute the estimated 
loss for each of the n policies. We repeat this procedure R = 50 times. To evaluate the 
performance of the two approaches, we consider the following measures for the estimated 
regression coefficients and the expected policy loss. For a parameter vector 7 6 I* with 
estimate 7, we are interested in the relative mean squared error which is defined as 



MSE re i -—El y/A — . (11) 
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In the rth iteration step, we obtain an estimate of (11) via 



dr) 1 / 7» - 7/ M 



i=i 



7» 



Here, 7< r ) is the estimate of 7 obtained in the rth step. In the simulation study, we compare 
the mean relative mean squared error 



I R 

MSE reI = -£MSEi2 



R , 

r=l 

computed over all R simulation runs. Note that its variance can be estimated via 
4sE rel = ^^iE(MSE2-MSErei) 2 . 

r=l ^ ' 

Further, we investigate the size of the estimated r, the estimated total loss, and the 
value of the Akaike information criterion 



AIC := -21 (v\x, y) + 2DoF , 

where the Degrees of Freedom (DoF) are the number of estimated parameters in the model. 
Note that we have p + q + 2 = 12 Degrees of Freedom for the joint model and p + q + l = 11 
Degrees of Freedom for the independence model. We prefer the model with the lower AIC 
score. 

Figure [4] displays the results for the Clayton copula. For each quantity that we compute 
in each of the R simulation runs, we display the mean over all R runs. The means are 
indicated by a square. The width of each error bar equals twice the standard deviation of 
the quantity, divided by ^fR. 

The upper row in Figure [4] displays the relative mean-squared error of a, (3, and the 
estimated expected policy loss. Overall, we observe that the relative mean squared error 
for the regression parameters (left and center panel) are not significantly different . For 
the policy loss however (right panel), the relative mean squared error is lower for the joint, 
copula-based model, and this improvement becomes more pronounced for higher values of 
r. 

The first column of the second row displays the estimated value of Kendall's r. Here, 
the dashed line indicates the true value of Kendall's r. We observe that the estimation 
of r is very good. Moreover, the AIC score (center panel in the second row) of the joint 
model is lower than the one of the marginal models . This shows that if joint model is the 
true model, then we have to use the joint estimation approach, i.e. the dependence cannot 
be ignored. The right lower panel displays the estimated total loss. The dashed horizontal 
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coefficients claim size (Clayton) 



coefficients number of claims (Clayton) 



policy loss (Clayton) 




Figure 4: Results of the simulation study for the Clayton copula. Top row: relative mean 
squared error ( |11[ ) for the average claim size (left), the number of claims (center) and 
then policy loss (right). Bottom row: estimated Kendall's r (left), AIC score (center) and 
estimated total loss (right). We display the mean over R runs. The width of the whiskers 
is twice the estimated standard deviation of the mean. Whiskers that are not displayed 
are too narrow to be visualized. 
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name description number of categories 

gen driver's gender 2 

rcl regional class 8 

bonus no-claims bonus 7 

ded type of deductible 5 

dist distance driven 5 

age driver's age 6 



const construction year of the car 7 



Table 3: Covariates in the German car insurance data set. 



lines are the true values of the total loss for the respective value of Kendall's r. We observe 
that the independence model systematically underestimates the total loss. This confirms 
the conclusions drawn from Figure [3} 

The results for the three other copula families confirm all the findings made for the 
Clayton copula. We display the results in [B] 



7 Case study: car insurance data 

We consider data provided by a German insurance company. It contains car insurance data 
for 7663 German insurance policy groups from the year 2000. It contains seven covariates 
and information on the exposure time. All seven covariates are categorical. The data was 



previously analyzed by Czado et al. (2011). Details on the covariates are given in Table 3 



7.1 Marginal models 

We first analyze the marginal models. We fit a Gamma regression model for the average 
claim size, and a zero-truncated Poisson regression model for the number of claims. Next, 
we investigate the significance of the estimated regression parameters a and (3. We are 



interested in those coefficients that are significantly different from 0. Recall (see Section 4.3 ) 
that asymptotically, these estimates are normally distributed, and that we can construct 
approximate confidence intervals using the observed Fisher information. In addition, we 



adjust the tests for multiple comparisons and the dependence of the estimators (Hothorn 



et al. , 2008 ). For the number of claims, the covariates age and construction year do not have 
any significant coefficients on a level of a = 0.05. Wit re-fit the marginal models, leaving 
out the respective non-significant covariates. Figure [5] displays the joint 95% confidence 
intervals of the coefficients, showing that the remaining covariates are significant on the 
5%-level. 
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Figure 5: Marginal regression models. Joint 95% confidence intervals for the regression 
coefficients. Left: Average claim size. Right: number of claims. 



7.2 Joint model 



We use the covariates selected in Section 7.1 and fit the joint regression model for each 
copula family. For each pair of copula families, we perform a corresponding Vuong test. 
Table [4] displays the results. For each pair, we display the copula family that is selected 
on a a = 0.05-level. In parentheses, we display the value of the Vuong test statistic (10). 
Note that a value > 2 indicates that we select model 1, and that a value < — 2 indicates 
that we select model 2. We conclude that the Clayton copula is preferred to each of the 
three other copula families. Therefore, for the remainder of this section, we continue our 
analysis with the Copula family. The AIC score for the Clayton model and independence 





Gauss 


mod 

Clayton 


3l 2 

Gumbel 


Frank 


h Gauss 
15 Clayton 
o Gumbel 
S Frank 


Clayton (+10.37) 
Gauss (-6.11) 
Frank (+5.34) 


Clayton (-10.37) 

Clayton (-9.23) 
Clayton (-9.54) 


Gauss (+6.11) 
Clayton (+9.23) 

Frank (+6.54) 


Frank (-5.34) 
Clayton (+9.54) 
Frank (-6.54) 



Table 4: Pairwise Vuong tests. We display the copula family that is selected on a a = 0.05- 
level. In parentheses, we display the value of the Vuong test statistic (10). A value > 2 
indicates that we select model 1, and that a value < — 2 indicates that we select model 2. 
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model are 



AIC c i ayton = 46 682.35 

AlCindcpcndcncc = 46 921.67. 

In terms of model comparison, this implies that the copula-based model is more appropriate 
than the independence model. The estimated value of Kendall's r ± its estimated standard 
deviation equals 

Tdayton = 0.268 ± 0.098 , 

which corresponds to a moderate, positive dependence between average claim sizes and 
number of claims. As a comparison, we note that the estimated value of Kendall's r for 
the Gauss copula equals 0.157, which implies that the selection of the copula family has a 
considerable effect on the estimation of the dependence parameter. Finally, we investigate 
the impact of this dependence on the estimation of the total loss. For the copula and 
independence model respectively, we obtain an estimated total loss ± its estimated standard 
deviation of 

^Clayton (T) = 81 751.07 ± 1239.766 

^independence (T) = 76 324.45 ± 1103.301 . 

As already illustrated in the simulation study, the negligence of the dependency structure 
leads to considerably lower estimates of the total loss. In our case, this corresponds to a 
ratio of ^ 

-^independence (T) 934 

-E-Clayton (T) 

which indicates a possibly severe underestimation of the independence model in presence of 
frequency-severity dependence. The more conservative estimate by the copula-based model 
takes this dependence into account and will thus result in a more appropriate premium 
rating protecting the insurance company from huge losses in the portfolio. 



8 Summary and discussion 

In this paper, we model average claim sizes and number of claims if both quantities are 
dependent. We provide exact distributions of individual policy losses, which tend to be 
left-skew, and - depending on the parameters of the model - can be multi-modal. Further, 
we propose a regression approach that models average claim sizes and number of claims in 
terms of a set of covariates. We showed theoretically and empirically that the explicit in- 
corporation of the dependency in terms of copulae has a profound impact on the estimation 
of the individual policy loss and the total loss. 
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Our model explicitly incorporates the discrete structure of the number of claims and 
allows a flexible class of copula families. This extends previous work that only consider 
the Gauss copula (Czado et al. 2011 De Leon and Wu, 2011). In our case study, we 



demonstrated that other copula families are more appropriate. 

We stress that our approach does not depend on the particular choice of the marginal 
distribution or copula family, and it can be extended to other parametric distributions 
and families (see e.g. Yee and Wild (1996) and Yee (2010) for an overview on appropriate 
marginal distributions). Moreover, in the case of higher-dimensional mixtures of discrete 
and continuous random variables, pair-copula constructions ( |Aas et al" 2009) can be used 
(see e.g. Panagiotclis et al. (2012) and Stober et al. (2012)). 

In our simulation study, we showed that a model that assumes independence of average 
claim sizes and number of claims consistently underestimates the total loss of the insurance 
portfolio implying a severe mispricing of policies. Knowing the true distribution of the 
policy loss and total loss allows us to correctly assess some risk. This is underpinned in our 
case study on German car insurance policies. Here, we select relevant covariates for the 
marginal models and choose the appropriate copula family for the dependence structure 
using a Vuong test. The data shows a moderate positive dependence. We illustrate that 
this leads to a more conservative estimation of the total loss, which avoids huge losses in the 
insurance portfolio and thus possibly filing for bankruptcy. Respecting actuarial prudence 
therefore requires to take into account possible dependencies between average claim sizes 
and numbers of claims. 
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A Copulae 



Table [5] provides the definition of the four bivariate copula families, their relationship to 
Kendall's t and their first partial derivative. Here $2 is the cumulative distribution function 
of the bivariate standard normal distribution, and $ is the cumulative distribution function 
of the univariate standard normal distribution. Further, 



X 



D k {x) = 




denotes the Debye function which is defined for k 6 N. 



family copula C(u, v, 9) range of 9 relationship to Kendall's r 

Gauss <J> 2 (^>- 1 (u),<S>- 1 (v)\9) ] - 1,1[ T = i arcsin(tf) 6 I 

Clayton {u~ B + - l) _1/f> ]0,oo[ r = ^ e]0, oo[ 

Gumbel exp (- ((- log u) e + (- log t>) 8 ) 1/6 J [l,oo[ r=^-e[0,oo[ 

-llogll+i e 4^ i M K\{0} r = l-|[l-D 1 (9)]6l\{0} 



Frank 



Table 5: Characteristics of selected copula families. 



family first partial derivative Di(u,v\9) 

^>- 1 (v)-9<S>- 1 (u) 

VI 1 - 62 ) 

Clayton ( U - e + v' 9 - l)' 1 ' 6 ' 1 u -«-l 



Gauss <3? 



Gumbel u 



Frank 



_1 exp f— (^(-logu) 9 + (-logv) 9 ^ 



'(e e "-l) 



e e(u+i)_l_ e e(u+i)„ e e_ e fl(u+«) 
Table 6: First partial derivative of selected copula families. 
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B Results of the simulation study 

We display the results for the Gauss copula (Figure [6]) , the Gumbel copula (Figure [7]) and 
the Frank copula (Figure |8j) . 
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Figure 6: Results of the simulation study for the Gauss copula. Top row: relative mean 
squared error ( |11[ ) for the average claim size (left), the number of claims (center) and 
then policy loss (right). Bottom row: estimated Kendall's r (left), AIC score (center) and 
estimated total loss (right). We display the mean over R runs. The width of the whiskers 
is twice the estimated standard deviation of the mean. Whiskers that are not displayed 
are too narrow to be visualized. 
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coefficients claim size (Gumbel) 



coefficients number of claims (Gumbel) 



policy loss (Gumbel) 




Figure 7: Results of the simulation study for the Gumbel copula. Top row: relative mean 
squared error ( |11[ ) for the average claim size (left), the number of claims (center) and 
then policy loss (right). Bottom row: estimated Kendall's r (left), AIC score (center) and 
estimated total loss (right). We display the mean over R runs. The width of the whiskers 
is twice the estimated standard deviation of the mean. Whiskers that are not displayed 
are too narrow to be visualized. 
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coefficients claim size (Frank) 
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policy loss (Frank) 
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Figure 8: Results of the simulation study for the Frank copula. Top row: relative mean 
squared error ( |11[ ) for the average claim size (left), the number of claims (center) and 
then policy loss (right). Bottom row: estimated Kendall's r (left), AIC score (center) and 
estimated total loss (right). We display the mean over R runs. The width of the whiskers 
is twice the estimated standard deviation of the mean. Whiskers that are not displayed 
are too narrow to be visualized. 
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